Breast Cancer Wisconsin (Diagnostic) Dataset

In this article, we compare a number of classification methods for the breast cancer dataset. The details regarding this dataset can be found in Diagnostic Wisconsin Breast Cancer Database. We would use the following classification methods and then compare them in terms of performance.

Throughout this website, there are a large number of methods that discuss these methods. Here, we will not discuss these methods and only apply them. Interested readers are encouraged to see Statistical Learning.

As can be seen, the number of instances is 569 and the number of attributes is 32. The object of the exercise is to create a classification model that can classify the type of Diagnosis base on the rest of the attributes. However, first, let's plot a count plot for Diagnosis attribute.

Modeling

KNeighbors

The k-neighbors classification is the most commonly used classification techniques. Please see K-Nearest Neighbors from Statistical Learning, and this link for more details.

Logistic Regression

Logistic regression utilizes a logistic function for a classification model. Please see Logistic Regression from Statistical Learning, and this link for more details.

PCA with Logistic Regression

We also can utilize principal component analysis (PCA) and logistic regression for unsupervised dimensionality reduction and he prediction.

Decision Tree Classifier

Decision Tree Classifier (DTC) is a classifier that uses a bootstrap aggregating. See sklearn.tree.DecisionTreeClassifier for more details.

Support Vector Machine

See this link for more details.

Random Forest Classifier

A random forest classifier (RFC) fits several decision tree classifiers on (using sub-samples of the dataset) and then averages them to improve the predictive accuracy. See sklearn.ensemble.RandomForestClassifier for more details.

Gradient Boosting Classifier

Gradient Boosting Classifier (GBC)optimizes a model in several stages using differentiable loss function. See sklearn.ensemble.GradientBoostingClassifier for more details.

Multi-layer Perceptron Classifier (Neural Network)

This model optimizes the log-loss function using LBFGS or stochastic gradient descent. See sklearn.neural_network.MLPClassifier.

Final Words

it seems that Gradient Boosting Classifier performing slightly better than the rest of the classification method in this study. All of these classification methods are tuned in a way that performs at their best by implementing GridSearchCV.


  1. Breast Cancer Wisconsin (Diagnostic) Data Set
  2. scikit-learn K-Neighbors Classifier
  3. k-nearest neighbors algorithm
  4. scikit-learn Logistic Regression
  5. Logistic Regression
  6. scikit-learn Decision Tree Classifier
  7. Decision Tree Classifier
  8. scikit-learn Support Vector Machines
  9. Support-Vector Machine
  10. scikit-learn Random Forest Classifier
  11. Random Forest Classifier
  12. scikit-learn Gradient Boosting Classifier
  13. Gradient boosting
  14. scikit-learn Neural network models (supervised)
  15. Multilayer perceptron